This project explores each variables and find out which variables are leading influencer on quality of red wine.
setwd("~/Downloads")
yo<- read.csv('yogurt.csv')
setwd('~/Downloads')
wine <- read.csv('wineQualityReds.csv')
dim(wine)
## [1] 1599 13
names(wine)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
str(wine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(wine)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
As you could see above, there are 12 variables and 1599 observations.
I will now look at the distribution of each 12 variables.
library(ggplot2)
ggplot(aes(x=fixed.acidity), data=wine)+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x=fixed.acidity), data=wine)+
geom_histogram(bins = 50)
summary(wine$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
table(wine$fixed.acidity)
##
## 4.6 4.7 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1
## 1 1 1 6 4 6 4 5 1 14 2 4 9 13 16
## 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6
## 20 14 25 17 37 28 46 38 50 57 67 44 44 52 46
## 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1
## 49 53 42 42 26 45 40 26 19 27 24 34 33 26 29
## 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10 10.1 10.2 10.3 10.4 10.5 10.6
## 16 22 17 14 17 9 15 26 23 10 19 11 21 12 14
## 10.7 10.8 10.9 11 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 12 12.1
## 10 10 8 3 9 5 7 5 13 12 3 3 12 7 1
## 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 13 13.2 13.3 13.4 13.5 13.7 13.8
## 4 5 4 7 4 4 5 2 3 3 3 1 1 2 1
## 14 14.3 15 15.5 15.6 15.9
## 1 1 2 2 2 1
From looking at the histogram of the fixed acidity, we could notice that distribution of fixed.acidty is normal with peak around at 7.8. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
ggplot(aes(x=volatile.acidity), data = wine)+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x=(volatile.acidity^(1/3))), data = wine)+
geom_histogram(bins = 30)
summary(wine$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
summary(wine$volatile.acidity^(1/3))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4932 0.7306 0.8041 0.7977 0.8618 1.1647
From looking at the histogram of the volatile.acidity^(1/3), we could notice that distribution of volatile.acidity^(1/3) is normal with peak around at 0.85. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
ggplot(aes(x=citric.acid), data = wine)+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x=(sqrt(citric.acid))), data = wine)+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(wine$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
From looking at the histogram of the sqrt(citric.acid), we could notice that distribution of sqrt(citric.acid) is normal with peak around at 0.5 and 0.75. There is another peak in 0 since sqrt of 0 is 0, which means transformation did not have effect on 0.
table(wine$citric.acid)
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
There are 132 wines that have 0 citric acid.
ggplot(aes(x=residual.sugar), data = wine)+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x=(log10(residual.sugar))), data = wine)+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(wine$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
From looking at the histogram of the log10(residual.sugar), we could notice that distribution of log10(residual.sugar) is normal with peak around at 0.3 and 0.4. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
ggplot(aes(x=chlorides), data = wine)+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x=(log10(chlorides))), data = wine)+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(wine$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
From looking at the histogram of the log10(chlorides), we could notice that distribution of log10(chlorides) is normal with peak around at -1.1 and -1.2. There is suspected outlier on the both sides and I should consider whether to exclude the outlier or not.
ggplot(aes(x=free.sulfur.dioxide), data = wine)+
geom_histogram(bins=50)
ggplot(aes(x=(sqrt(free.sulfur.dioxide))), data = wine)+
geom_histogram(bins=50)
summary(wine$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
From looking at the histogram of the sqrt(free.sulfur.dioxide), we could notice that distribution of sqrt(free.sulfur.dioxide) is skewed to right with peaks around at 2.5 and 4. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
ggplot(aes(x=total.sulfur.dioxide), data = wine)+
geom_histogram(bins=50)
ggplot(aes(x=(log10(total.sulfur.dioxide))), data = wine)+
geom_histogram(bins=50)
summary(wine$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
From looking at the histogram of the log10(total.sulfur.dioxide), we could notice that distribution of log10(total.sulfur.dioxide) is normal with peaks around at 1.5 and 1.75. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
In total sulfur dioxide there is free and bound forms. I will make another variable for bound sulfur dioxide by subtracting free sulfur dioxide from total sulfur dioxide.
wine$bound.sulfur.dioxide <- wine$total.sulfur.dioxide-wine$free.sulfur.dioxide
ggplot(aes(x=bound.sulfur.dioxide), data = wine)+
geom_histogram(bins=50)
ggplot(aes(x=(log10(bound.sulfur.dioxide))), data = wine)+
geom_histogram(bins=50)
summary(wine$bound.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 21.00 30.59 39.00 251.50
From looking at the histogram of the log10(bound.sulfur.dioxide), we could notice that distribution of log10(bound.sulfur.dioxide) is normal with peaks around at 1 and 1.5. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
ggplot(aes(x=density), data = wine)+
geom_histogram(bins=50)
summary(wine$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
table(wine$density)
##
## 0.99007 0.9902 0.99064 0.9908 0.99084 0.9912 0.9915 0.99154 0.99157
## 2 1 2 1 1 1 1 1 1
## 0.9916 0.99162 0.9917 0.99182 0.99191 0.9921 0.9922 0.99235 0.99236
## 2 1 1 2 1 1 2 1 1
## 0.9924 0.99242 0.99252 0.99256 0.99258 0.99264 0.9927 0.9928 0.99286
## 3 2 1 1 3 1 1 2 1
## 0.9929 0.99292 0.99294 0.99306 0.99314 0.99316 0.99318 0.9932 0.99322
## 1 1 2 1 1 2 1 1 1
## 0.99323 0.99328 0.9933 0.99331 0.99332 0.99334 0.99336 0.9934 0.99341
## 1 1 1 2 1 1 1 4 1
## 0.99344 0.99346 0.99348 0.9935 0.99352 0.99354 0.99356 0.99357 0.99358
## 1 3 1 1 2 2 4 1 3
## 0.9936 0.99362 0.99364 0.9937 0.99371 0.99374 0.99376 0.99378 0.99379
## 2 2 1 2 2 2 3 3 1
## 0.9938 0.99384 0.99385 0.99386 0.99387 0.99388 0.99392 0.99394 0.99395
## 1 1 1 1 1 2 2 1 1
## 0.99396 0.99397 0.994 0.99402 0.99408 0.9941 0.99414 0.99416 0.99417
## 3 1 2 4 3 1 2 1 1
## 0.99418 0.99419 0.9942 0.99425 0.99426 0.99428 0.9943 0.99434 0.99437
## 2 2 3 1 1 1 2 1 1
## 0.99438 0.99439 0.9944 0.99444 0.99448 0.99451 0.99454 0.99456 0.99458
## 5 1 3 4 4 1 1 1 4
## 0.99459 0.9946 0.99462 0.99464 0.99467 0.99468 0.9947 0.99471 0.99472
## 1 5 2 2 2 1 6 3 3
## 0.99473 0.99474 0.99476 0.99478 0.99479 0.9948 0.99483 0.99484 0.99486
## 1 1 3 2 1 9 1 3 1
## 0.99488 0.99489 0.9949 0.99491 0.99492 0.99494 0.99495 0.99496 0.99498
## 4 3 4 1 2 4 2 1 5
## 0.99499 0.995 0.99501 0.99502 0.99504 0.99506 0.99508 0.99509 0.9951
## 1 10 1 2 2 1 3 1 4
## 0.99512 0.99514 0.99516 0.99517 0.99518 0.99519 0.9952 0.99521 0.99522
## 2 5 6 1 3 1 9 1 4
## 0.99523 0.99524 0.99525 0.99526 0.99528 0.99529 0.9953 0.99531 0.99532
## 1 4 2 2 3 1 4 2 1
## 0.99533 0.99534 0.99536 0.99538 0.9954 0.99541 0.99542 0.99543 0.99544
## 1 6 2 11 4 1 1 2 1
## 0.99545 0.99546 0.99547 0.99549 0.9955 0.99551 0.99552 0.99553 0.99554
## 3 7 2 2 14 3 5 1 3
## 0.99555 0.99556 0.99557 0.99558 0.9956 0.99562 0.99564 0.99565 0.99566
## 1 2 3 3 14 4 2 3 4
## 0.99568 0.99569 0.9957 0.99572 0.99573 0.99574 0.99575 0.99576 0.99577
## 4 1 6 9 1 2 2 5 3
## 0.99578 0.9958 0.99581 0.99582 0.99584 0.99585 0.99586 0.99587 0.99588
## 3 14 1 1 2 3 6 2 4
## 0.99589 0.9959 0.99592 0.99593 0.99594 0.99596 0.99598 0.99599 0.996
## 1 13 4 2 1 2 2 2 13
## 0.99603 0.99604 0.99605 0.99606 0.99608 0.99609 0.9961 0.99612 0.99613
## 2 3 3 2 2 1 10 6 4
## 0.99614 0.99615 0.99616 0.99617 0.99619 0.9962 0.99621 0.99622 0.99623
## 2 5 7 1 1 28 1 5 2
## 0.99624 0.99625 0.99627 0.99628 0.99629 0.9963 0.99631 0.99632 0.99633
## 3 3 3 3 2 15 1 4 4
## 0.99634 0.99635 0.99636 0.99638 0.99639 0.9964 0.99641 0.99642 0.99643
## 3 1 5 5 2 25 1 3 1
## 0.99645 0.99646 0.99647 0.99648 0.99649 0.9965 0.99651 0.99652 0.99654
## 1 1 2 3 1 11 1 6 2
## 0.99655 0.99656 0.99658 0.99659 0.9966 0.99661 0.99664 0.99665 0.99666
## 6 5 1 2 23 1 3 1 3
## 0.99667 0.99668 0.99669 0.9967 0.99672 0.99674 0.99675 0.99676 0.99677
## 1 4 2 13 5 2 5 3 2
## 0.99678 0.9968 0.99682 0.99683 0.99684 0.99685 0.99686 0.99688 0.99689
## 1 35 2 2 1 8 3 2 4
## 0.9969 0.99692 0.99693 0.99694 0.99695 0.99697 0.99698 0.99699 0.997
## 18 4 2 3 1 1 1 1 24
## 0.99701 0.99702 0.99704 0.99705 0.99706 0.99708 0.99709 0.9971 0.99712
## 2 4 3 1 2 4 1 13 4
## 0.99713 0.99714 0.99716 0.99717 0.99718 0.99719 0.9972 0.99721 0.99722
## 2 2 2 1 3 1 36 1 1
## 0.99724 0.99725 0.99726 0.99727 0.99728 0.99729 0.9973 0.99732 0.99733
## 4 1 1 1 3 1 18 3 1
## 0.99734 0.99735 0.99736 0.99738 0.99739 0.9974 0.99743 0.99744 0.99745
## 4 6 5 4 1 22 2 2 9
## 0.99746 0.99747 0.99748 0.9975 0.99752 0.99754 0.99756 0.99758 0.9976
## 7 2 3 7 1 1 1 1 35
## 0.99761 0.99764 0.99765 0.99768 0.99769 0.9977 0.99772 0.99774 0.99779
## 1 1 1 3 2 4 1 5 1
## 0.9978 0.99782 0.99783 0.99784 0.99785 0.99786 0.99787 0.99788 0.9979
## 26 2 2 1 1 4 3 2 14
## 0.99791 0.99796 0.99798 0.998 0.99801 0.99803 0.99808 0.9981 0.99814
## 1 1 2 29 2 3 1 10 2
## 0.99815 0.99817 0.99818 0.9982 0.99822 0.99823 0.99824 0.99828 0.9983
## 2 2 3 23 1 1 3 2 9
## 0.99832 0.99834 0.99836 0.9984 0.99842 0.99845 0.9985 0.99852 0.99854
## 1 1 2 20 2 1 3 1 1
## 0.99855 0.99859 0.9986 0.99864 0.99865 0.9987 0.99878 0.9988 0.99888
## 2 1 19 1 2 12 1 20 2
## 0.9989 0.99892 0.999 0.99901 0.9991 0.99914 0.99915 0.99918 0.9992
## 2 3 8 1 10 3 1 1 7
## 0.99922 0.99925 0.9993 0.99935 0.99938 0.99939 0.9994 0.9995 0.9996
## 1 1 4 1 1 1 24 1 12
## 0.99965 0.9997 0.99974 0.99975 0.99976 0.9998 0.9999 1 1.00005
## 1 8 1 1 1 10 1 10 2
## 1.0001 1.00012 1.00015 1.0002 1.00024 1.00025 1.0003 1.0004 1.0006
## 4 1 2 10 1 1 2 9 6
## 1.0008 1.001 1.0014 1.0015 1.0018 1.0021 1.0022 1.00242 1.0026
## 3 6 6 2 1 2 2 2 2
## 1.00289 1.00315 1.0032 1.00369
## 1 3 1 2
From looking at the histogram of the density, we could notice that distribution of density is normal with peak around at 0.997.
ggplot(aes(x=pH), data = wine)+
geom_histogram(bins=50)
summary(wine$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
From looking at the histogram of the pH, we could notice that distribution of pH is normal with peak around at 3.25 and 3.3.
ggplot(aes(x=sulphates), data = wine)+
geom_histogram(bins=50)
ggplot(aes(x=(log10(sulphates))), data = wine)+
geom_histogram(bins=50)
summary(wine$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
From looking at the histogram of the log10(sulphates), we could notice that distribution of log10(sulphates) is normal to right with peaks around at -0.2. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
ggplot(aes(x=alcohol), data = wine)+
geom_histogram(bins=50)
summary(wine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
From looking at the histogram of the alcohol, we could notice that distribution of alcohol is skewed to right with peaks around at 9.5. There is suspected outlier on the right and I should consider whether to exclude the outlier or not.
ggplot(aes(x=quality), data = wine)+
geom_histogram(bins = 6)+
scale_x_continuous(name = "quality" ,c(3,4,5,6,7,8))
summary(wine$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
table(wine$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
From looking at the table above, we could notice that most wine received 5 or 6 on their quality.
There are 1599 red wine in the dataset with 13 variables, including that I have made.
data.frame’: 1599 obs. of 14 variables: $ X : int 1 2 3 4 5 6 7 8 9 10 … $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 … $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 … $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 … $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 … $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 … $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 … $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 … $ density : num 0.998 0.997 0.997 0.998 0.998 … $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 … $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 … $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 … $ quality : int 5 5 5 6 5 5 5 7 7 5 … $ bound.sulfur.dioxide: num 23 42 39 43 23 27 44 6 9 85 …
I have added bound sulfur dioxide variable because bound sulfur might be the one that influence the quality of the wine.
I also had to transform various variables to make the distribution normal. Most of the graphes were skewed to right, so I used log10 and sqrt function to make the distribution normal.
For next secion, I will explore to determine which variables are best for predicting the quality of the red wine.
cor(wine)
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.26848392 -0.008815099
## fixed.acidity -0.268483920 1.00000000 -0.256130895
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## bound.sulfur.dioxide -0.178263036 -0.07814929 0.097033939
## citric.acid residual.sugar chlorides
## X -0.15355136 -0.031260835 -0.119868519
## fixed.acidity 0.67170343 0.114776724 0.093705186
## volatile.acidity -0.55249568 0.001917882 0.061297772
## citric.acid 1.00000000 0.143577162 0.203822914
## residual.sugar 0.14357716 1.000000000 0.055609535
## chlorides 0.20382291 0.055609535 1.000000000
## free.sulfur.dioxide -0.06097813 0.187048995 0.005562147
## total.sulfur.dioxide 0.03553302 0.203027882 0.047400468
## density 0.36494718 0.355283371 0.200632327
## pH -0.54190414 -0.085652422 -0.265026131
## sulphates 0.31277004 0.005527121 0.371260481
## alcohol 0.10990325 0.042075437 -0.221140545
## quality 0.22637251 0.013731637 -0.128906560
## bound.sulfur.dioxide 0.06677604 0.174529035 0.055479649
## free.sulfur.dioxide total.sulfur.dioxide density
## X 0.090479643 -0.11784967 -0.36837209
## fixed.acidity -0.153794193 -0.11318144 0.66804729
## volatile.acidity -0.010503827 0.07647000 0.02202623
## citric.acid -0.060978129 0.03553302 0.36494718
## residual.sugar 0.187048995 0.20302788 0.35528337
## chlorides 0.005562147 0.04740047 0.20063233
## free.sulfur.dioxide 1.000000000 0.66766645 -0.02194583
## total.sulfur.dioxide 0.667666450 1.00000000 0.07126948
## density -0.021945831 0.07126948 1.00000000
## pH 0.070377499 -0.06649456 -0.34169933
## sulphates 0.051657572 0.04294684 0.14850641
## alcohol -0.069408354 -0.20565394 -0.49617977
## quality -0.050656057 -0.18510029 -0.17491923
## bound.sulfur.dioxide 0.425148917 0.95768634 0.09513464
## pH sulphates alcohol quality
## X 0.13600533 -0.125306999 0.24512284 0.06645261
## fixed.acidity -0.68297819 0.183005664 -0.06166827 0.12405165
## volatile.acidity 0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid -0.54190414 0.312770044 0.10990325 0.22637251
## residual.sugar -0.08565242 0.005527121 0.04207544 0.01373164
## chlorides -0.26502613 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.07037750 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456 0.042946836 -0.20565394 -0.18510029
## density -0.34169933 0.148506412 -0.49617977 -0.17491923
## pH 1.00000000 -0.196647602 0.20563251 -0.05773139
## sulphates -0.19664760 1.000000000 0.09359475 0.25139708
## alcohol 0.20563251 0.093594750 1.00000000 0.47616632
## quality -0.05773139 0.251397079 0.47616632 1.00000000
## bound.sulfur.dioxide -0.10805328 0.032244043 -0.22320257 -0.20546298
## bound.sulfur.dioxide
## X -0.17826304
## fixed.acidity -0.07814929
## volatile.acidity 0.09703394
## citric.acid 0.06677604
## residual.sugar 0.17452903
## chlorides 0.05547965
## free.sulfur.dioxide 0.42514892
## total.sulfur.dioxide 0.95768634
## density 0.09513464
## pH -0.10805328
## sulphates 0.03224404
## alcohol -0.22320257
## quality -0.20546298
## bound.sulfur.dioxide 1.00000000
cor(wine)[13,]
## X fixed.acidity volatile.acidity
## 0.06645261 0.12405165 -0.39055778
## citric.acid residual.sugar chlorides
## 0.22637251 0.01373164 -0.12890656
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH sulphates alcohol
## -0.05773139 0.25139708 0.47616632
## quality bound.sulfur.dioxide
## 1.00000000 -0.20546298
From looking at above, we could notice that volatile acidity, alcohol, and sulphates of red wine has highest correlation to quality of red wine.
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
ggpairs(wine,
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
From looking at the table above we could notice that there might be Multicollinearity problem if we look at multivariate relationship. I will consider this fact in the next section for multivariate relatinoship.
From looking at the table above it is hard to notice whether there is linear relationship between variables with quality of red wine, especially when quality is more of a categorical variable.
I will look closely into it through graphing and using anova test.
ggplot(aes(x=fixed.acidity, y=quality), data = wine)+
geom_point()
From looking at the scatter plot above, it is hard to notice the relationship between red wine quality and fixed acidity, especially since wine quality is a categorical variable. I will look into box plots to examine the relationship and I will change the quality as a factor from numeric to make a boxplot.
wine$quality <- as.factor(wine$quality)
str(wine)
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## $ bound.sulfur.dioxide: num 23 42 39 43 23 27 44 6 9 85 ...
ggplot(aes(y=fixed.acidity, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=fixed.acidity, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)
From looking at the boxplots above we don’t see much strong relationship between quality and fixed acidity.
acid_quality <- aov(fixed.acidity ~ quality, data = wine)
summary(acid_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 94 18.737 6.283 8.79e-06 ***
## Residuals 1593 4751 2.982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and fixed acidity. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(acid_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = fixed.acidity ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -0.58075472 -2.27948640 1.1179770 0.9257629
## 5-3 -0.19274596 -1.76223366 1.3767417 0.9993075
## 6-3 -0.01282132 -1.58307424 1.5574316 1.0000000
## 7-3 0.51236181 -1.08439601 2.1091196 0.9426320
## 8-3 0.20666667 -1.73661257 2.1499459 0.9996570
## 5-4 0.38800876 -0.31462496 1.0906425 0.6148684
## 6-4 0.56793340 -0.13640797 1.2722748 0.1942859
## 7-4 1.09311653 0.33151423 1.8547188 0.0006306
## 8-4 0.78742138 -0.55672768 2.1315705 0.5509949
## 6-5 0.17992465 -0.09155105 0.4514003 0.4080237
## 7-5 0.70510777 0.30806829 1.1021472 0.0000067
## 8-5 0.39941263 -0.77716674 1.5759920 0.9278394
## 7-6 0.52518313 0.12512942 0.9252368 0.0025626
## 8-6 0.21948798 -0.95811196 1.3970879 0.9948930
## 8-7 -0.30569514 -1.51841231 0.9070220 0.9796484
plot(TukeyHSD(acid_quality))
From the graph above, we could notice that there is a significant difference between 7-4, 7-5 and 7-6.
ggplot(aes(y=volatile.acidity, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=volatile.acidity, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)
From looking at the boxplots above, we could notice some relationship between volatile acidity and quality. It seems like volatile acidity decrease as quality of wine increases.
vol_acid_quality <- aov(volatile.acidity ~ quality, data = wine)
summary(vol_acid_quality )
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 8.22 1.645 60.91 <2e-16 ***
## Residuals 1593 43.01 0.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and volatile acidity. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(vol_acid_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = volatile.acidity ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -0.19053774 -0.35217798 -0.02889749 0.0102247
## 5-3 -0.30745888 -0.45680111 -0.15811665 0.0000001
## 6-3 -0.38701567 -0.53643072 -0.23760063 0.0000000
## 7-3 -0.48058040 -0.63251748 -0.32864332 0.0000000
## 8-3 -0.46116667 -0.64607647 -0.27625687 0.0000000
## 5-4 -0.11692115 -0.18377920 -0.05006310 0.0000099
## 6-4 -0.19647794 -0.26349848 -0.12945740 0.0000000
## 7-4 -0.29004267 -0.36251178 -0.21757355 0.0000000
## 8-4 -0.27062893 -0.39852940 -0.14272846 0.0000000
## 6-5 -0.07955679 -0.10538865 -0.05372493 0.0000000
## 7-5 -0.17312152 -0.21090121 -0.13534183 0.0000000
## 8-5 -0.15370778 -0.26566341 -0.04175215 0.0013080
## 7-6 -0.09356473 -0.13163123 -0.05549822 0.0000000
## 8-6 -0.07415099 -0.18620374 0.03790175 0.4098254
## 8-7 0.01941374 -0.09598053 0.13480800 0.9968509
plot(TukeyHSD(vol_acid_quality))
From the graph above, we could notice that there is a significant difference between every variables except 6-8 and 7-8. From looking at the graph above, I am considering whether I should group the quality into three sections as low(3-4), medium(5-6), and high(7-8) in order to better explained the relationship between volatile acidity and quality.
ggplot(aes(y=citric.acid, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=citric.acid, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)
From looking at the boxplots above, we could notice some relationship between citric acid and quality. It seems like citric acid increase as quality of wine increases.
cit_acid_quality <- aov(citric.acid ~ quality, data = wine)
summary(cit_acid_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 3.53 0.7059 19.69 <2e-16 ***
## Residuals 1593 57.11 0.0359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and citric acid. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(cit_acid_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = citric.acid ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 0.003150943 -0.1831058063 0.18940769 1.0000000
## 5-3 0.072685756 -0.0994000884 0.24477160 0.8345084
## 6-3 0.102824451 -0.0693452965 0.27499420 0.5292715
## 7-3 0.204175879 0.0291000127 0.37925175 0.0115446
## 8-3 0.220111111 0.0070410437 0.43318118 0.0381644
## 5-4 0.069534813 -0.0075051774 0.14657480 0.1039655
## 6-4 0.099673508 0.0224462830 0.17690073 0.0032561
## 7-4 0.201024936 0.1175193597 0.28453051 0.0000000
## 8-4 0.216960168 0.0695814861 0.36433885 0.0004036
## 6-5 0.030138695 0.0003728525 0.05990454 0.0451915
## 7-5 0.131490123 0.0879568901 0.17502336 0.0000000
## 8-5 0.147425355 0.0184197856 0.27643092 0.0144221
## 7-6 0.101351428 0.0574877008 0.14521516 0.0000000
## 8-6 0.117286660 -0.0118308104 0.24640413 0.0998116
## 8-7 0.015935232 -0.1170326519 0.14890312 0.9993852
plot(TukeyHSD(cit_acid_quality))
From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-4, 7-4, 8-4, 6-5, 7-5, 8-5, and 7-6.
ggplot(aes(y=residual.sugar, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=residual.sugar, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)+
coord_cartesian(ylim = c(1,5))
From looking at the boxplots above we don’t see much strong relationship between quality and residual sugar.
sug_quality <- aov(residual.sugar ~ quality, data = wine)
summary(sug_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 10 2.094 1.053 0.385
## Residuals 1593 3166 1.988
From looking at the anova table, we cannot reject null hypothesis that there isn’t significant relationship between quality and residual sugar.
ggplot(aes(y=chlorides, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=chlorides, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)+
coord_cartesian(ylim = c(0,0.2))
From looking at the boxplots above we don’t see much strong relationship between quality and chlorides.
chl_acid_quality <- aov(chlorides~ quality, data = wine)
summary(chl_acid_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.066 0.013162 6.036 1.53e-05 ***
## Residuals 1593 3.474 0.002181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and chlorides. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(chl_acid_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = chlorides ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -0.031820755 -0.07775835 0.0141168441 0.3563775
## 5-3 -0.029764317 -0.07220686 0.0126782279 0.3421496
## 6-3 -0.037543887 -0.08000713 0.0049193515 0.1180933
## 7-3 -0.045912060 -0.08909205 -0.0027320685 0.0295304
## 8-3 -0.054055556 -0.10660628 -0.0015048304 0.0395900
## 5-4 0.002056438 -0.01694439 0.0210572639 0.9996262
## 6-4 -0.005723132 -0.02477014 0.0133238728 0.9563871
## 7-4 -0.014091306 -0.03468678 0.0065041663 0.3707314
## 8-4 -0.022234801 -0.05858367 0.0141140711 0.5018527
## 6-5 -0.007779570 -0.01512090 -0.0004382449 0.0304543
## 7-5 -0.016147743 -0.02688460 -0.0054108855 0.0002720
## 8-5 -0.024291238 -0.05610864 0.0075261647 0.2484623
## 7-6 -0.008368173 -0.01918654 0.0024501961 0.2349638
## 8-6 -0.016511668 -0.04835667 0.0153333334 0.6775878
## 8-7 -0.008143495 -0.04093815 0.0246511567 0.9809645
plot(TukeyHSD(chl_acid_quality))
From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-5, and 7-5.
ggplot(aes(y=free.sulfur.dioxide, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=free.sulfur.dioxide, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)
From looking at the boxplots above we don’t see much strong relationship between quality and free sulfur dioxide.
free_sul_quality <- aov(free.sulfur.dioxide~ quality, data = wine)
summary(free_sul_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 2571 514.1 4.754 0.000257 ***
## Residuals 1593 172274 108.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and free sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(free_sul_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = free.sulfur.dioxide ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 1.2641509 -8.9655860 11.4938879 0.9992862
## 5-3 5.9838473 -3.4675842 15.4352788 0.4618281
## 6-3 4.7115987 -4.7444410 14.1676385 0.7138656
## 7-3 3.0452261 -6.5704257 12.6608780 0.9456583
## 8-3 2.2777778 -9.4246209 13.9801764 0.9937451
## 5-4 4.7196963 0.4884466 8.9509461 0.0185784
## 6-4 3.4474478 -0.7940854 7.6889810 0.1868980
## 7-4 1.7810752 -2.8052825 6.3674329 0.8782125
## 8-4 1.0136268 -7.0808188 9.1080725 0.9992387
## 6-5 -1.2722485 -2.9070711 0.3625740 0.2288173
## 7-5 -2.9386212 -5.3295870 -0.5476553 0.0061996
## 8-5 -3.7060695 -10.7914129 3.3792739 0.6692481
## 7-6 -1.6663726 -4.0754901 0.7427448 0.3580539
## 8-6 -2.4338210 -9.5253103 4.6576683 0.9246011
## 8-7 -0.7674484 -8.0704130 6.5355163 0.9996765
plot(TukeyHSD(free_sul_quality))
From the graph above, we could notice that there is a significant difference between 5-4 and 7-5.
ggplot(aes(y=bound.sulfur.dioxide, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=bound.sulfur.dioxide, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)+
coord_cartesian(ylim = c(0,125))
From looking at the boxplots above we don’t see much strong relationship between quality and bound sulfur dioxide.
bound_sul_quality <- aov(bound.sulfur.dioxide~ quality, data = wine)
summary(bound_sul_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 98706 19741 29.36 <2e-16 ***
## Residuals 1593 1071097 672
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and bound sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(bound_sul_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = bound.sulfur.dioxide ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 10.0811321 -15.426418 35.588682 0.8699705
## 5-3 25.6301028 2.063234 49.196971 0.0238790
## 6-3 11.2583072 -12.320052 34.836666 0.7496028
## 7-3 7.0748744 -16.901472 31.051221 0.9596104
## 8-3 6.2666667 -22.912922 35.446256 0.9901376
## 5-4 15.5489707 4.998473 26.099468 0.0003955
## 6-4 1.1771751 -9.398964 11.753314 0.9995713
## 7-4 -3.0062577 -14.442207 8.429691 0.9754993
## 8-4 -3.8144654 -23.997729 16.368798 0.9945496
## 6-5 -14.3717956 -18.448178 -10.295413 0.0000000
## 7-5 -18.5552284 -24.517032 -12.593425 0.0000000
## 8-5 -19.3634361 -37.030533 -1.696339 0.0221440
## 7-6 -4.1834328 -10.190497 1.823631 0.3501748
## 8-6 -4.9916405 -22.674062 12.690781 0.9665845
## 8-7 -0.8082077 -19.017937 17.401521 0.9999955
plot(TukeyHSD(bound_sul_quality))
From the graph above, we could notice that there is a significant difference between 5-3, 5-4, 6-5, 7-5 and 8-5.
ggplot(aes(y=total.sulfur.dioxide, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=total.sulfur.dioxide, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)+
coord_cartesian(ylim = c(0,160))
From looking at the boxplots above we don’t see much strong relationship between quality and total sulfur dioxide.
total_sul_quality <- aov(total.sulfur.dioxide~ quality, data = wine)
summary(total_sul_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 128045 25609 25.48 <2e-16 ***
## Residuals 1593 1601155 1005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and total sulfur dioxide. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(total_sul_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = total.sulfur.dioxide ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 11.345283 -19.841526 42.532092 0.9051128
## 5-3 31.613950 2.799916 60.427984 0.0219162
## 6-3 15.969906 -12.858177 44.797989 0.6115609
## 7-3 10.120101 -19.194583 39.434784 0.9228115
## 8-3 8.544444 -27.131983 44.220872 0.9838108
## 5-4 20.268667 7.369100 33.168234 0.0001149
## 6-4 4.624623 -8.306295 17.555541 0.9112220
## 7-4 -1.225183 -15.207347 12.756982 0.9998676
## 8-4 -2.800839 -27.477908 21.876231 0.9995284
## 6-5 -15.644044 -20.628033 -10.660055 0.0000000
## 7-5 -21.493850 -28.783049 -14.204650 0.0000000
## 8-5 -23.069506 -44.670183 -1.468828 0.0283464
## 7-6 -5.849805 -13.194343 1.494732 0.2059503
## 8-6 -7.425462 -29.044876 14.193953 0.9243726
## 8-7 -1.575656 -23.839783 20.688471 0.9999539
plot(TukeyHSD(total_sul_quality))
From the graph above, we could notice that there is a significant difference between 5-3, 5-4, 6-5, 7-5 and 8-5.
ggplot(aes(y=density, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=density, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)
From looking at the boxplots above we don’t see much strong relationship between quality and density.
den_quality <- aov(density~ quality, data = wine)
summary(den_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.000230 4.594e-05 13.4 8.12e-13 ***
## Residuals 1593 0.005462 3.430e-06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and density. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(den_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = density ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -9.215472e-04 -0.0027431246 9.000302e-04 0.7003175
## 5-3 -3.603730e-04 -0.0020433600 1.322614e-03 0.9902708
## 6-3 -8.489373e-04 -0.0025327449 8.348703e-04 0.7033996
## 7-3 -1.359729e-03 -0.0030719578 3.525005e-04 0.2088099
## 8-3 -2.251778e-03 -0.0043355875 -1.679681e-04 0.0253891
## 5-4 5.611742e-04 -0.0001922713 1.314620e-03 0.2747470
## 6-4 7.260987e-05 -0.0006826668 8.278865e-04 0.9997910
## 7-4 -4.381815e-04 -0.0012548599 3.784970e-04 0.6443084
## 8-4 -1.330231e-03 -0.0027715834 1.111221e-04 0.0899646
## 6-5 -4.885643e-04 -0.0007796721 -1.974566e-04 0.0000271
## 7-5 -9.993557e-04 -0.0014251075 -5.736038e-04 0.0000000
## 8-5 -1.891405e-03 -0.0031530698 -6.297397e-04 0.0002889
## 7-6 -5.107913e-04 -0.0009397754 -8.180729e-05 0.0090996
## 8-6 -1.402840e-03 -0.0026655999 -1.400810e-04 0.0193569
## 8-7 -8.920491e-04 -0.0021924653 4.083671e-04 0.3677080
plot(TukeyHSD(den_quality))
From the graph above, we could notice that there is a significant difference between 8-3, 6-5, 7-5, 7-6 and 8-5.
ggplot(aes(y=pH, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=pH, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)
From looking at the boxplots above it seems like ph level decrease as quality increase. However, we should not conclude anything yet.
ph_quality <- aov(pH~ quality, data = wine)
summary(ph_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 0.51 0.10242 4.342 0.000628 ***
## Residuals 1593 37.58 0.02359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and pH. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(ph_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = pH ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 -0.01649057 -0.16757254 0.1345914093 0.9996104
## 5-3 -0.09305140 -0.23263865 0.0465358649 0.4012091
## 6-3 -0.07992790 -0.21958322 0.0597274183 0.5767071
## 7-3 -0.10724623 -0.24925884 0.0347663821 0.2599190
## 8-3 -0.13077778 -0.30360935 0.0420537932 0.2578317
## 5-4 -0.07656083 -0.13905174 -0.0140699183 0.0064502
## 6-4 -0.06343733 -0.12608012 -0.0007945477 0.0451336
## 7-4 -0.09075567 -0.15849113 -0.0230202009 0.0019007
## 8-4 -0.11428721 -0.23383328 0.0052588577 0.0704301
## 6-5 0.01312350 -0.01102104 0.0372680287 0.6312170
## 7-5 -0.01419484 -0.04950677 0.0211171021 0.8615725
## 8-5 -0.03772638 -0.14236912 0.0669163551 0.9083845
## 7-6 -0.02731833 -0.06289835 0.0082616867 0.2425756
## 8-6 -0.05084988 -0.15558338 0.0538836280 0.7359924
## 8-7 -0.02353155 -0.13138831 0.0843252185 0.9893982
plot(TukeyHSD(ph_quality))
From the graph above, we could notice that there is a significant difference between 5-4, 6-4, and 7-4.
ggplot(aes(y=sulphates, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=sulphates, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)+
coord_cartesian(ylim = c(0,1.1))
From looking at the boxplots above it seems like sulphates level increase as quality increase. However, we should not conclude anything yet.
sulp_quality <- aov(sulphates~ quality, data = wine)
summary(sulp_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 3.00 0.6000 22.27 <2e-16 ***
## Residuals 1593 42.91 0.0269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and sulphates. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(sulp_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sulphates ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 0.02641509 -0.13504180 0.18787198 0.9972425
## 5-3 0.05096916 -0.09820366 0.20014199 0.9259342
## 6-3 0.10532915 -0.04391640 0.25457471 0.3348774
## 7-3 0.17125628 0.01949155 0.32302101 0.0164864
## 8-3 0.19777778 0.01307773 0.38247782 0.0276634
## 5-4 0.02455407 -0.04222814 0.09133628 0.9011170
## 6-4 0.07891406 0.01196955 0.14585857 0.0102225
## 7-4 0.14484119 0.07245428 0.21722810 0.0000002
## 8-4 0.17136268 0.04360729 0.29911807 0.0018695
## 6-5 0.05435999 0.02855743 0.08016255 0.0000000
## 7-5 0.12028712 0.08255028 0.15802395 0.0000000
## 8-5 0.14680861 0.03497998 0.25863725 0.0025621
## 7-6 0.06592713 0.02790380 0.10395045 0.0000123
## 8-6 0.09244862 -0.01947701 0.20437426 0.1723998
## 8-7 0.02652150 -0.08874188 0.14178487 0.9864895
plot(TukeyHSD(sulp_quality))
From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 6-4, 7-4, 8-4, 6-5, 7-5, 8-5, and 7-6.
ggplot(aes(y=alcohol, x=quality), data = wine)+
geom_boxplot()
ggplot(aes(y=alcohol, x=quality), data = wine)+
geom_boxplot(outlier.shape = NA)
From looking at the boxplots above it seems like alcohol level increase as quality increase. However, we should not conclude anything yet.
alc_quality <- aov(alcohol~ quality, data = wine)
summary(alc_quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 5 483.9 96.79 115.9 <2e-16 ***
## Residuals 1593 1330.8 0.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From looking at the anova table, we could accept alternative hypothesis that there is a significant relationship between quality and alcohol. However, anova table doesn’t tell us which quality groups are different from each other. I will futher look into the which ones are different through post hoc test.
TukeyHSD(alc_quality)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = alcohol ~ quality, data = wine)
##
## $quality
## diff lwr upr p adj
## 4-3 0.31009434 -0.589020145 1.209208824 0.9231095
## 5-3 -0.05529369 -0.886001167 0.775413796 0.9999660
## 6-3 0.67451933 -0.156593176 1.505631838 0.1882542
## 7-3 1.51091290 0.665771726 2.356054069 0.0000056
## 8-3 2.13944444 1.110894424 3.167994465 0.0000001
## 5-4 -0.36538803 -0.737282044 0.006505993 0.0574326
## 6-4 0.36442499 -0.008372862 0.737222846 0.0597032
## 7-4 1.20081856 0.797713311 1.603923806 0.0000000
## 8-4 1.82935010 1.117911150 2.540789059 0.0000000
## 6-5 0.72981302 0.586124800 0.873501234 0.0000000
## 7-5 1.56620658 1.356059244 1.776353923 0.0000000
## 8-5 2.19473813 1.571991432 2.817484828 0.0000000
## 7-6 0.83639357 0.624650838 1.048136295 0.0000000
## 8-6 1.46492511 0.841638238 2.088211988 0.0000000
## 8-7 0.62853155 -0.013342374 1.270405467 0.0589299
plot(TukeyHSD(alc_quality))
From the graph above, we could notice that there is a significant difference between 7-3, 8-3, 7-4, 8-4, 6-5, 7-5, 8-5, 8-6 and 7-6.
F-value - Quality x Fixed Acidity: 6.283 - Quality x Volatile Acidity: 60.91 - Quality x Citric Acid: 19.69 - Quality x Residual Sugar: 1.053 - Quality x Chlorides: 6.036 - Quality x Free Sulfur Dioxide: 4.754 - Quality x Bound Sulfur Dioxide: 29.36 - Quality x Total Sulfur Dioxide: 25.48 - Quality x Density: 13.4 - Quality x pH: 4.342 - Quality x Sulphates: 22.27 - Quality x Alcohol: 115.9
Only residual sugar has low F-value to reject null hypothesis. Alcohol, volatile acidity, bound sulfur dioxide, sulphates, and citric acid had high f-value, so I will use these variables to further investigate the relationship.
From observation of bivariate plot, I noticed that it will better to reorganize quality into three categories as low, medium, and high.
wine$quality_3 <- cut(as.numeric(as.character(wine$quality)), c(2,4,6,9), labels=c("low", "medium", "high"))
library(reshape2)
ggplot(aes(x=alcohol,color = quality_3, y=(volatile.acidity^(1/3))), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=(volatile.acidity^(1/3))) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
From looking at the graph it is hard to notice any relationship. Instead of dividing quality into low, medium, and high, I will also create a variable to divide quality into low(3~5) and high(6~8).
wine$quality_2 <- cut(as.numeric(as.character(wine$quality)), c(2,5,9), labels=c("low", "high"))
ggplot(aes(x=alcohol,color = quality_2, y=(volatile.acidity^(1/3))), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=(volatile.acidity^(1/3))) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
Even dividing quality into two categories, it doesn’t seems like there is a relationship. I will futher explore with other variables.
ggplot(aes(x=alcohol,color = quality_3, y=log10(bound.sulfur.dioxide)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=log10(bound.sulfur.dioxide)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=alcohol,color = quality_2, y=log10(bound.sulfur.dioxide)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=log10(bound.sulfur.dioxide)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph it is hard to notice any relationship. Only thing I could notice is that low quality red wine is less dispersed with alcohol level compared to high quality red wine. I will futher explore with other variables.
ggplot(aes(x=alcohol,color = quality_3, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=log10(sulphates)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=alcohol,color = quality_2, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=log10(sulphates)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph it is hard to notice any relationship. Only thing I could notice is that low alcohol level wine is more dispersed in sulphates level compared to high alcohol level. I will futher explore with other variables.
ggplot(aes(x=alcohol,color = quality_3, y=(citric.acid)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=(citric.acid)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=alcohol,color = quality_2, y=(citric.acid)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=(citric.acid)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph it is hard to notice any relationship. I will futher explore with other variables.
ggplot(aes(x=alcohol,color = quality_3, y=(citric.acid)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=(citric.acid)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=alcohol,color = quality_2, y=(density)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=alcohol, y=(citric.acid)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph it is hard to notice any relationship except high quality wine is more dispersed in alcohol level. I will futher explore with other variables.
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_3, y=log10(bound.sulfur.dioxide)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=(volatile.acidity)^(1/3), y=log10(bound.sulfur.dioxide)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_2, y=log10(bound.sulfur.dioxide)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=(volatile.acidity)^(1/3), y=log10(bound.sulfur.dioxide)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph, it is hard to notice any relationship. I will futher explore with other variables.
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_3, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=(volatile.acidity)^(1/3), y=log10(sulphates)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_2, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=(volatile.acidity)^(1/3), y=log10(sulphates)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher sulphates and lower volatile acidity compared to low quality wine. I will also put alcohol variable into graph to observe the relationship. I will divide alcohol into 5 categories to plot the graph.
wine$alcohol_lev <- cut(wine$alcohol, breaks = 5, labels = c("very low", "low", "medium", "high", "very high"))
ggplot(aes(x=(volatile.acidity)^(1/3), color = alcohol_lev, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)+
facet_wrap(~quality_2)
From looking at the plot above, we could notice that high quality wine tends to have higher alcohol level, higher sulphate level, and low volatile acidity compared to low quality wine.
I will futher explore with other variables.
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_3, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=(volatile.acidity)^(1/3), y=log10(sulphates)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_2, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=(volatile.acidity)^(1/3), y=log10(sulphates)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph, it is hard to notice any relationship but we could notice that high quality wine tends to have higher sulphates and lower volatile acidity compared to low quality wine. I will futher explore with other variables.
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_3, y=(citric.acid)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=(volatile.acidity)^(1/3), y=(citric.acid)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_2, y=(citric.acid)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=(volatile.acidity)^(1/3), y=(citric.acid)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph, it is hard to notice any relationship. I will futher explore with other variables.
ggplot(aes(x=log10(bound.sulfur.dioxide),color = quality_3, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=log10(bound.sulfur.dioxide), y=log10(sulphates)) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=log10(bound.sulfur.dioxide),color = quality_2, y=log10(sulphates)), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=log10(bound.sulfur.dioxide), y=log10(sulphates)) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph, it is hard to notice any relationship. I will futher explore with other variables.
ggplot(aes(x=log10(bound.sulfur.dioxide),color = quality_3, y=citric.acid), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=log10(bound.sulfur.dioxide), y=citric.acid) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=log10(bound.sulfur.dioxide),color = quality_2, y=citric.acid), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=log10(bound.sulfur.dioxide), y=citric.acid) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph, it is hard to notice any relationship. I will futher explore with other variables.
ggplot(aes(x=log10(sulphates),color = quality_3, y=citric.acid), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=log10(sulphates), y=citric.acid) , data = wine)+
geom_point()+
facet_wrap(~quality_3)
ggplot(aes(x=log10(sulphates),color = quality_2, y=citric.acid), data = wine)+
geom_point()+
scale_color_brewer(type = 'div', palette = 2)
ggplot(aes(x=log10(sulphates), y=citric.acid) , data = wine)+
geom_point()+
facet_wrap(~quality_2)
From looking at the graph, it is hard to notice any relationship but I could notice that high quality wine tends to have higher sulphates compared low quality wine. I will futher explore with other variables.
For most of the plots, it was hard to identify a strong relationship among variables. As it was shwon in univariate analysis, alcohol seemed to have most influence on the quality of wine than other variables.
ggplot(aes(y=alcohol, x=quality, color=quality), data = wine)+
geom_boxplot()+
scale_y_continuous(name = "Alcohol Level")+
scale_x_discrete(name = "Quality of Red Wine")+
ggtitle("Boxplot of alcohol level by quality of red wines")+
scale_color_brewer(name= "Quality", palette="Set2")
The plot indicates that wines with high quality tends to have high alcohol level compared to wines with low quality.
ggplot(aes(x=(volatile.acidity)^(1/3),color = quality_2, y=log10(sulphates)), data = wine)+
geom_point()+
scale_y_continuous(name = "Sulphates")+
scale_x_continuous(name = "Volatile Acidity")+
scale_color_brewer(name = "Quality", type = 'div', palette = 2)+
ggtitle("Scatter plot of sulphates by volatile acidity and quality of red wine")
Even though it is not evident, we could notice that high quality wine tends to have high sulphates and low volatile acidity compared to low quality wine. However, it would hard to predict a wine quality through just looking at sulphates and volatile acidity of red wine due to high variance.
ggplot(aes(x=(volatile.acidity)^(1/3), color = alcohol_lev, y=log10(sulphates)), data = wine)+
geom_point()+
scale_y_continuous(name = "Sulphates")+
scale_x_continuous(name = "Volatile Acidity")+
scale_color_brewer(name = "Alcohol Level", type = 'div', palette = 2)+
facet_wrap(~quality_2)+
ggtitle("Scatter plot of sulphates by volatile acidity, alcohol, and quality of red wine")
From looking at the plot we could notice that high quality wine tends to have high alcohol level, sulphates, and volatile acidity compared to low quality wines. As mention before, the relationship is not strong enough to predict quality of red wine based on its alcohol level, sulphates and volatile acidity.
The red wine data set contained 1600 red wine with 12 variables. I have explored each variables distribution and bivariate model to identify relationship between variables and quality of red wine. Since quality of red wine was a categorical variable, I couldn’t use linear model but I have used boxplots to see relationship. The difficulty with boxplots was there was exact standard to conclude whether the relationship bewteen quality and variable was strong enough. Especially, with small number of the data, variance was too large to identify the relationship. With numerous limitation, I still found alcohol, sulphates, and volatile acidity to have most influence on the quality of red wine.
From this project, I realize the quality of red wine, which is decided by wine experts, is more complex to be explained by those 12 given variables. For next time, I hope there could be more variables as price of the wine and data to explore.